Exploration of Assessment Neighborhoods in the District

The FIS team and I have begun to examine the role of transformative investments in the District. Our first pass at this analysis was presented at the 2013 Fall NTA Meeting, and we need to clean a few items up for a proceedings submission. This script is intended to provide some exploratory graphics to capture some economic context over the study period.

In [74]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pysal as ps
import geopandas as gp
from IPython.display import HTML
import seaborn as sb
from pandas.tools.plotting import scatter_matrix
import statsmodels.api as sm
from mpltools import color
import scipy

#Set print width
pd.set_option('line_width',100)

The data for this exercise will come from the annual summary stats by neighborhood compiled in the following files in '/home/choct155/ORA/EconDev':

  1. edev_mean.csv
  2. edev_median.csv

First, let's read in the data.

In [2]:
#Establish working directory
workdir='/home/choct155/ORA/EconDev/'

#Read in data
mean_temp=pd.read_csv(workdir+'edev_mean.csv',index_col=['aneigh','year']).sortlevel(0)
med_temp=pd.read_csv(workdir+'edev_median.csv',index_col=['aneigh','year']).sortlevel(0)

#Slice to exclude empty neighborhoods
mean=mean_temp.ix[('16th Street Heights',2001):('Woodridge',2011)]
med=med_temp.ix[('16th Street Heights',2001):('Woodridge',2011)]

#Combine sets
stats=mean.join(med[['50%_iit','50%_prop']])

stats.head()
Out[2]:
mean_iit mean_prop exp num_bbl 50%_iit 50%_prop
aneigh year
16th Street Heights 2001 37423.540696 185900.003636 NaN 40 25567.0 148394
2002 39058.201993 192702.633166 NaN 6 26094.0 150692
2003 39203.947657 194879.192164 NaN 15 26078.5 155063
2004 42839.623425 306471.427547 NaN 33 27727.0 252520
2005 43622.986597 399701.294067 NaN 70 28930.0 336380

5 rows × 6 columns

The above data must be augmented with spatial references. We can grab said references using the assessment neighborhoood shapefile.

In [3]:
#Read in spatial data
nbhd_temp=gp.GeoDataFrame.from_file(workdir+'as_nbhd.shp').rename(columns={'NEIGHBORHO':'aneigh'})

#Set neighborhoods as row names
nbhd=nbhd_temp.set_index('aneigh')
nbhd.head()
Out[3]:
DESCRIPTIO GIS_ID LABELB NBHD NBHD_NUM NBHD_TEXT NHDLNK OBJECTID SHAPE_AREA SHAPE_LEN geometry
aneigh
Georgetown 025 Georgetown nhood_025 None 025 25 25 025 32 0 0 POLYGON ((394168.0625003650784492 138547.57810...
Glover Park 026 Glover Park nhood_026 None 026 26 26 026 33 0 0 POLYGON ((393275.6563011109828949 139746.56249...
Hawthorne 027 Hawthorne nhood_027 None 027 27 27 027 34 0 0 POLYGON ((395475.5000040307641029 145682.45309...
Hillcrest 028 Hillcrest nhood_028 None 028 28 28 028 35 0 0 POLYGON ((403452.1249997988343239 134667.18750...
Kalorama 029 Kalorama nhood_029 None 029 29 29 029 36 0 0 POLYGON ((395912.3124985843896866 139486.20310...

5 rows × 11 columns

Now we need to merge this information together. To make this happen, we will need to split the economic data by year, and merge it with the spatial information. All told, we will create 11 GeoDataFrames. Since we are walking before running here, let's test the 2001 case.

In [4]:
#Join 2001 economic data with spatial references
nbhd01=gp.GeoDataFrame(nbhd.join(stats.xs(2001,level='year')))
nbhd01.head()
Out[4]:
DESCRIPTIO GIS_ID LABELB NBHD NBHD_NUM NBHD_TEXT NHDLNK OBJECTID SHAPE_AREA SHAPE_LEN geometry mean_iit mean_prop exp num_bbl 50%_iit 50%_prop
aneigh
16th Street Heights 049 16th Street Heights nhood_049 None 049 49 49 049 19 0 0 POLYGON ((397753.9375005662441254 141723.29690... 37423.540696 185900.003636 NaN 40 25567 148394.0
American University 001 American University nhood_001 None 001 1 1 001 4 0 0 POLYGON ((393362.3124991357326508 141606.35940... 104302.134924 343016.255255 NaN 33 68921 295183.0
Anacostia 002 Anacostia nhood_002 None 002 2 2 002 5 0 0 POLYGON ((402129.8438016176223755 134197.10940... 22235.685076 89607.184932 NaN 82 18286 79293.5
Anacostia Park 064 Anacostia Park nhood_064 None 064 64 64 064 59 0 0 POLYGON ((402293.9999981075525284 134723.28129... NaN NaN NaN NaN NaN NaN
Barry Farms 003 Barry Farms nhood_003 None 003 3 3 003 6 0 0 POLYGON ((399873.1874952167272568 133147.24999... 18984.787792 140198.988095 NaN 2 16373 62936.5

5 rows × 17 columns

Does it plot?

In [5]:
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=3, colormap='OrRd')
/home/choct155/analysis/Anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Out[5]:
<matplotlib.axes.AxesSubplot at 0x2fe9dd0>

Doesn't look quite right. I am wondering if it's an issue with GeoPandas or the data. Do pre-merge values plot appropriately?

In [6]:
nbhd.plot(column='NBHD_NUM', scheme='QUANTILES', k=3, colormap='OrRd')
Out[6]:
<matplotlib.axes.AxesSubplot at 0x5c52e90>

They appear to do so. What do the underlying income data look like in 2001?

In [81]:
#Plot distribution
fig,axes=plt.subplots(figsize=(16,6))

#Plot average AGI distribution
sb.kdeplot(nbhd01['mean_iit'][nbhd01['mean_iit'].notnull()],ax=axes,shade=True)
sb.rugplot(nbhd01['mean_iit'][nbhd01['mean_iit'].notnull()],ax=axes,color='#999999')

#Set title
axes.set_title('Average AGI by Assessment Neighborhood (2001)',fontsize=16)

#Set background color
axes.patch.set_facecolor('white')

The plot above captures the kernel density estimate of the mean_iit distribution. Nothing looks all that strange. The spacing looks a little funny because of the relatively low number of observations (maximum of 70 per year).

Perhaps it has to do with the number of discrete categories. Let's change that up.

In [8]:
plt.rcParams['figure.figsize']=18,20
fig,axes=plt.subplots(2,2,sharex=True,sharey=True)
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=3, colormap='OrRd',axes=axes[0,0])
axes[0,0].set_title('k=3')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=4, colormap='OrRd',axes=axes[0,1])
axes[0,1].set_title('k=4')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='OrRd',axes=axes[1,0])
axes[1,0].set_title('k=5')
nbhd01.plot(column='mean_iit', scheme='QUANTILES', k=6, colormap='OrRd',axes=axes[1,1])
axes[1,1].set_title('k=6')
plt.suptitle('Distribution of mean_iit by Quantile Bin Count',fontsize=20)
Out[8]:
<matplotlib.text.Text at 0x741f090>

I believe we have identified the source of the problem. We should be specifying quintiles with k=5. It would also be useful to see what the different classification schemes offer. In addition to 'QUANTILES', the following classification rules can be employed:

In [9]:
#Set plot size
plt.rcParams['figure.figsize']=18,9

#Create plotting object
fig,axes=plt.subplots(1,2,sharex=True,sharey=True)

#Plot Fisher-Jenks and change bg
nbhd01.plot(column='mean_iit', scheme='fisher_jenks', k=8, colormap='OrRd',axes=axes[0])
axes[0].set_title('Fisher-Jenks')
axes[0].patch.set_facecolor('white')

#Plot Equal Interval and change bg
nbhd01.plot(column='mean_iit', scheme='equal_interval', k=8, colormap='OrRd',axes=axes[1])
axes[1].set_title('Equal Interval')
axes[1].patch.set_facecolor('white')

It looks like extreme values on the high-end are probably disrupting things. For Fisher-Jenks, the scheme seeks to minimize variance within groups while maximizing variance across groups. I suspect to many observations on the low end fall into a single category in an effort to accommodate the very high variance among the highest values. For Equal Interval, the gap between the highest values and more typical values is probably consuming multiple groups. The Quantile approach looks like a winner.

Spatio-Temporal Plotting

Now we have an approach that can be used to examine these distribution plots over time. The first order of business is generating GeoDataFrames for each year.

In [10]:
#Create container for GDFs
nbhd_gdf_list=[]

#Create container for years
yr_list=[]

#For each year in 2001-2011...
for yr in range(2001,2012):
    #...grab the year...
    yr_list.append(yr)
    #...and the newly merged GDF
    nbhd_gdf_list.append(gp.GeoDataFrame(nbhd.join(stats.xs(yr,level='year'))))
    
#Capture all GDFs in a dictionary
nbhd_dict=dict(zip(yr_list,nbhd_gdf_list))

nbhd_dict[2001].head()
Out[10]:
DESCRIPTIO GIS_ID LABELB NBHD NBHD_NUM NBHD_TEXT NHDLNK OBJECTID SHAPE_AREA SHAPE_LEN geometry mean_iit mean_prop exp num_bbl 50%_iit 50%_prop
aneigh
16th Street Heights 049 16th Street Heights nhood_049 None 049 49 49 049 19 0 0 POLYGON ((397753.9375005662441254 141723.29690... 37423.540696 185900.003636 NaN 40 25567 148394.0
American University 001 American University nhood_001 None 001 1 1 001 4 0 0 POLYGON ((393362.3124991357326508 141606.35940... 104302.134924 343016.255255 NaN 33 68921 295183.0
Anacostia 002 Anacostia nhood_002 None 002 2 2 002 5 0 0 POLYGON ((402129.8438016176223755 134197.10940... 22235.685076 89607.184932 NaN 82 18286 79293.5
Anacostia Park 064 Anacostia Park nhood_064 None 064 64 64 064 59 0 0 POLYGON ((402293.9999981075525284 134723.28129... NaN NaN NaN NaN NaN NaN
Barry Farms 003 Barry Farms nhood_003 None 003 3 3 003 6 0 0 POLYGON ((399873.1874952167272568 133147.24999... 18984.787792 140198.988095 NaN 2 16373 62936.5

5 rows × 17 columns

Now we can plot the quintile distribution over time. It would be useful to directly compare the income (adjusted gross income) and property (assessed value) data. It's difficult to see annual trends, so we will subset to three years: [2001,2006,2011]. It would also be useful to capture the difference over the 2001-2011 period in one plot.

In [11]:
#Calculate difference across the time period
dstats=stats.xs(2011,level='year')-stats.xs(2001,level='year')

#Create new GDF
nbhd_diff=gp.GeoDataFrame(nbhd.join(dstats))

nbhd_diff
Out[11]:
DESCRIPTIO GIS_ID LABELB NBHD NBHD_NUM NBHD_TEXT NHDLNK OBJECTID SHAPE_AREA SHAPE_LEN geometry mean_iit mean_prop exp num_bbl 50%_iit 50%_prop
aneigh
16th Street Heights 049 16th Street Heights nhood_049 None 049 49 49 049 19 0 0 POLYGON ((397753.9375005662441254 141723.29690... 10151.815607 388336.458667 NaN 1 4268.0 332076.000000
American University 001 American University nhood_001 None 001 1 1 001 4 0 0 POLYGON ((393362.3124991357326508 141606.35940... 48946.910969 702888.630979 NaN 9 19946.5 490262.000000
Anacostia 002 Anacostia nhood_002 None 002 2 2 002 5 0 0 POLYGON ((402129.8438016176223755 134197.10940... 6660.024686 242204.491070 NaN 5 4907.0 152326.500000
Anacostia Park 064 Anacostia Park nhood_064 None 064 64 64 064 59 0 0 POLYGON ((402293.9999981075525284 134723.28129... NaN NaN NaN NaN NaN NaN
Barry Farms 003 Barry Farms nhood_003 None 003 3 3 003 6 0 0 POLYGON ((399873.1874952167272568 133147.24999... 9978.098752 307822.957557 NaN 28 5527.0 131568.500000
Berkley 004 Berkley nhood_004 None 004 4 4 004 7 0 0 POLYGON ((392165.6250010132789612 139856.43749... 164098.758926 995437.929621 NaN 3 50297.5 742387.000000
Bolling Air Force Base 068 Bolling Air Force Base nhood_068 None 068 68 68 068 62 0 0 POLYGON ((401009.8124961480498314 133673.04690... -1868.991438 NaN NaN NaN -3565.0 NaN
Brentwood 005 Brentwood nhood_005 None 005 5 5 005 8 0 0 POLYGON ((401884.4374966025352478 138898.84380... 7438.344301 1110490.656648 2400000 7 7485.5 184072.000000
Brightwood 006 Brightwood nhood_006 None 006 6 6 006 9 0 0 POLYGON ((398606.3124998658895493 144936.40629... 8061.447121 322120.657318 NaN 45 2842.5 232805.500000
Brookland 007 Brookland nhood_007 None 007 7 7 007 10 0 0 POLYGON ((400027.9687028750777245 142992.17189... 12210.956033 666751.514142 NaN 44 7130.0 208699.000000
Burleith 008 Burleith nhood_008 None 008 8 8 008 11 0 0 POLYGON ((393436.2690014690160751 138835.49999... 69729.518633 702932.305543 NaN NaN -1446.0 503108.500000
Capitol Hill 009 Capitol Hill nhood_009 None 009 9 9 009 12 0 0 POLYGON ((399695.4374968484044075 136326.67190... 31792.605978 1058918.104868 NaN 83 18021.5 464412.500000
Central-tri 1 010b Central-tri 1 nhood_010b b 010b 10 10b 010 14 0 0 POLYGON ((396116.5312031209468842 138027.82810... -58227.694720 NaN NaN 229 20932.5 NaN
Central-tri 3 010a Central-tri 3 nhood_010a a 010a 10 10a 010 13 0 0 POLYGON ((397086.3438006937503815 137596.21870... 29201.353576 3879304.458221 10876855 203 20773.0 293410.000000
Chevy Chase 011 Chevy Chase nhood_011 None 011 11 11 011 15 0 0 POLYGON ((394655.4374978095293045 145649.68750... 47862.543371 632182.740512 NaN 48 23040.5 480577.000000
Chillum 012 Chillum nhood_012 None 012 12 12 012 17 0 0 POLYGON ((398634.6874989196658134 144820.20309... 20582.451995 294516.475790 NaN 2 12054.5 229368.000000
Cleveland Park 013 Cleveland Park nhood_013 None 013 13 13 013 18 0 0 POLYGON ((393218.2187999263405800 141593.68750... 73818.875090 679051.354777 NaN 92 23070.0 259520.000000
Colonial Village 014 Colonial Village nhood_014 None 014 14 14 014 1 0 0 POLYGON ((396911.6563012674450874 146971.51559... 24208.040453 1812437.827012 NaN 11 15016.5 384807.500000
Columbia Heights 015 Columbia Heights nhood_015 None 015 15 15 015 16 0 0 POLYGON ((397162.7813012897968292 141562.84370... 19383.194682 451491.683001 7951050 138 10183.5 275373.500000
Congress Heights 016 Congress Heights nhood_105 None 016 16 16 016 71 0 0 POLYGON ((399985.0720029100775719 131233.67849... 4187.924837 372078.749701 NaN 106 3266.0 129361.500000
Crestwood 017 Crestwood nhood_017 None 017 17 17 017 2 0 0 POLYGON ((396841.6249987483024597 141435.96880... 21995.671585 503810.410199 NaN 2 4610.5 444070.000000
DC Stadium Area 063 DC Stadium Area nhood_063 None 063 63 63 063 58 0 0 POLYGON ((405088.1875019520521164 138918.29689... -1046.250000 NaN NaN NaN 285.0 NaN
DC Village 069 DC Village nhood_069 None 069 69 69 069 63 0 0 (POLYGON ((398565.5624995529651642 128479.9922... 23600.777778 1797443.666667 NaN 1 27656.0 78077.000000
Deanwood 018 Deanwood nhood_018 None 018 18 18 018 3 0 0 POLYGON ((406289.7813053801655769 137716.20310... 5291.613271 180946.135372 7247447 52 3387.0 147899.000000
Eckington 019 Eckington nhood_019 None 019 19 19 019 26 0 0 POLYGON ((400260.8125038370490074 139138.40630... 20814.307764 366703.343489 NaN 22 16719.5 246439.000000
Foggy Bottom 020 Foggy Bottom nhood_020 None 020 20 20 020 27 0 0 POLYGON ((395376.5312048494815826 137310.29689... -16491.399838 2754654.369355 NaN 71 3806.0 209299.000000
Forest Hills 021 Forest Hills nhood_021 None 021 21 21 021 28 0 0 POLYGON ((395207.2188001498579979 142082.34370... -1093.373087 689750.187364 NaN 35 11030.0 204745.000000
Fort Drive 070 Fort Drive nhood_102 None 070 70 70 070 68 0 0 POLYGON ((391960.6874961256980896 140329.85939... NaN NaN NaN NaN NaN NaN
Fort Dupont Park 022 Fort Dupont Park nhood_022 None 022 22 22 022 29 0 0 POLYGON ((403489.9374958276748657 135773.31250... 7282.945077 226305.942636 632143 32 3585.5 143253.000000
Fort Lincoln 066 Fort Lincoln nhood_066 None 066 66 66 066 61 0 0 POLYGON ((403678.1874998435378075 140328.85939... 18066.389539 NaN NaN NaN 13142.0 NaN
Foxhall 023 Foxhall nhood_023 None 023 23 23 023 30 0 0 POLYGON ((392935.0311989337205887 138281.51560... 39602.033240 510498.259921 NaN NaN 4008.5 479230.000000
Ft. McNair 074 Ft. McNair nhood_074 None 074 74 74 074 66 0 0 POLYGON ((398663.3437953889369965 133801.34379... -5480.500000 NaN NaN NaN -8830.0 NaN
Garfield 024 Garfield nhood_024 None 024 24 24 024 31 0 0 POLYGON ((395059.1874988898634911 140522.84380... 30017.093323 52565.354167 NaN 28 13848.0 334600.000000
Georgetown 025 Georgetown nhood_025 None 025 25 25 025 32 0 0 POLYGON ((394168.0625003650784492 138547.57810... 25122.189981 935925.263105 NaN 27 21593.5 555034.000000
Glover - Archbold Parkway 071 Glover - Archbold Parkway nhood_103 None 071 71 71 071 69 0 0 POLYGON ((392934.8124960809946060 139809.53129... NaN NaN NaN NaN NaN NaN
Glover Park 026 Glover Park nhood_026 None 026 26 26 026 33 0 0 POLYGON ((393275.6563011109828949 139746.56249... 19509.988590 405261.850330 NaN 28 14259.5 224460.000000
Hawthorne 027 Hawthorne nhood_027 None 027 27 27 027 34 0 0 POLYGON ((395475.5000040307641029 145682.45309... 86721.234174 474825.417684 NaN -3 18287.0 445453.000000
Hillcrest 028 Hillcrest nhood_028 None 028 28 28 028 35 0 0 POLYGON ((403452.1249997988343239 134667.18750... 13396.165866 227305.864652 13772750 35 9231.5 196861.000000
Kalorama 029 Kalorama nhood_029 None 029 29 29 029 36 0 0 POLYGON ((395912.3124985843896866 139486.20310... 24763.661029 846810.090518 NaN 65 15057.0 319820.000000
Kent 030 Kent nhood_030 None 030 30 30 030 37 0 0 POLYGON ((390366.3749948143959045 140878.43749... 240163.120126 729490.176600 NaN 9 17280.0 625763.000000
Ledroit Park 031 Ledroit Park nhood_031 None 031 31 31 031 38 0 0 POLYGON ((399215.7813035175204277 139388.92189... 22923.025812 521233.593381 NaN 24 18352.5 277809.000000
Lily Ponds 032 Lily Ponds nhood_032 None 032 32 32 032 39 0 0 POLYGON ((405441.8750004172325134 138383.34379... 5712.345324 353777.098540 NaN 25 4579.0 140672.000000
Mall/East Potomac Park 072 Mall/East Potomac Park nhood_072 None 072 72 72 072 64 0 0 POLYGON ((396577.0000022724270821 136774.32809... 33153.000000 NaN NaN -1 33153.0 NaN
Marshall Heights 033 Marshall Heights nhood_033 None 033 33 33 033 46 0 0 POLYGON ((405649.9999989196658134 135771.01560... 6159.992912 217203.523967 NaN 13 3552.5 131486.000000
Massachusetts Avenue Heights 034 Massachusetts Avenue Heights nhood_034 None 034 34 34 034 47 0 0 POLYGON ((394500.8438037559390068 139977.15620... 5009.530000 4248743.894029 NaN 3 320349.0 1509370.000000
Michigan Park 035 Michigan Park nhood_035 None 035 35 35 035 48 0 0 POLYGON ((401945.9686973169445992 142061.78120... 11929.801971 262366.340877 NaN 14 5711.0 211698.000000
Mt. Pleasant 036 Mt. Pleasant nhood_036 None 036 36 36 036 40 0 0 POLYGON ((396839.8749978095293045 140943.95309... 22371.275978 550802.817562 NaN 72 14175.0 376095.000000
National Arboretum 065 National Arboretum nhood_065 None 065 65 65 065 60 0 0 POLYGON ((404176.5625032037496567 139040.87500... NaN NaN NaN NaN NaN NaN
National Zoological Park 061 National Zoological Park nhood_101 None 061 61 61 061 67 0 0 POLYGON ((395538.5311947539448738 140655.71869... NaN NaN NaN NaN NaN NaN
North Cleveland Park 037 North Cleveland Park nhood_037 None 037 37 37 037 50 0 0 POLYGON ((393193.3436947166919708 142225.64059... 38400.645485 806667.861538 NaN 14 20304.0 465198.500000
Observatory Circle 038 Observatory Circle nhood_038 None 038 38 38 038 51 0 0 POLYGON ((393860.9062053486704826 139963.15620... 39501.974584 1081260.913584 NaN 28 16261.0 357959.000000
Old City 1 039 Old City 1 nhood_039 None 039 39 39 039 41 0 0 POLYGON ((401407.6250021383166313 136917.12499... 36529.954832 626074.550251 38456297 300 22321.0 303870.500000
Old City 2 040 Old City 2 nhood_040 None 040 40 40 040 42 0 0 POLYGON ((397846.5000044405460358 139029.73440... 34780.293051 597281.919948 -116681412 295 22820.0 305360.000000
Palisades 041 Palisades nhood_041 None 041 41 41 041 43 0 0 POLYGON ((390085.2812975496053696 140875.01560... 56428.322562 644882.269481 NaN 6 15612.0 431789.500000
Petworth 042 Petworth nhood_042 None 042 42 42 042 44 0 0 POLYGON ((399220.5624953210353851 142002.45309... 13394.969827 262862.006245 NaN 69 5414.5 237621.000000
R.L.A. (N.E.) 044 R.L.A. (N.E.) nhood_044 None 044 44 44 044 49 0 0 POLYGON ((399380.7187976986169815 136793.06249... 32975.297872 10629753.914894 -16853565 15 36531.0 445635.500000
R.L.A. (S.W.) 046 R.L.A. (S.W.) nhood_046 None 046 46 46 046 52 0 0 POLYGON ((398930.1249984651803970 135657.31250... 21499.967069 2684381.574585 NaN 64 15690.0 158175.666667
Randle Heights 043 Randle Heights nhood_043 None 043 43 43 043 45 0 0 POLYGON ((402521.8749964311718941 133517.04689... 5142.774570 352277.200037 NaN 31 3015.5 162908.500000
Riggs Park 047 Riggs Park nhood_047 None 047 47 47 047 53 0 0 POLYGON ((399834.9686963036656380 144114.68749... 5357.155275 210957.685658 NaN 13 5025.0 165331.000000
Rock Creek Park 060 Rock Creek Park nhood_060 None 060 60 60 060 56 0 0 POLYGON ((396190.3124968558549881 146329.64060... NaN 1828254.090909 NaN 0 NaN -232105.681818
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

71 rows × 17 columns

With the difference calculated, we can move ahead to plotting.

In [12]:
nbhd_dict[2006][nbhd_dict[2006]['mean_iit'].isnull()]['mean_iit']
Out[12]:
aneigh
Anacostia Park             NaN
Fort Drive                 NaN
National Arboretum         NaN
National Zoological Park   NaN
Name: mean_iit, dtype: float64
In [13]:
#Set plot size
plt.rcParams['figure.figsize']=18,32

#Update nbhd_dict to include the difference information
nbhd_dict['diff']=nbhd_diff

#Create list of year subset
yr_sub=[2001,2006,2011,'diff']

#Create plotting object
fig,axes=plt.subplots(len(yr_sub),2)

#For each year in 2001-2011...
for i in range(len(yr_sub)):
    #...plot the distribution of mean_iit...
    nbhd_dict[yr_sub[i]].plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i,0])
    axes[i,0].set_title('Average AGI in '+str(yr_sub[i]))
    axes[i,0].patch.set_facecolor('white')
    axes[i,0].set_aspect(1)
    axes[i,0].set_axis_off()
    #...and the distribution of mean_prop
    nbhd_dict[yr_sub[i]].plot(column='mean_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i,1])
    axes[i,1].set_title('Average Assessment Value in '+str(yr_sub[i]))
    axes[i,1].patch.set_facecolor('white')
    axes[i,1].set_aspect(1)
    axes[i,1].set_axis_off()

#Remove unused space
plt.tight_layout()

#Adjust titles on last two plots
axes[3,0].set_title('Average AGI Difference (2001-2011)')
axes[3,1].set_title('Average Assessment Value Difference (2001-2011)')

#Set image location on disk
img_dir='/home/choct155/ORA/EconDev/'

plt.savefig(img_dir+'AvgTaxBase_nbhd.png')

The general eastward trend is expected. It is interesting to note, however, that most of the income growth has been captured in the middle of the city, while property growth has been greater in the more residential areas of Ward 5 and surrounding areas.

What about the median values?

In [14]:
#Create plotting object
fig,axes=plt.subplots(len(yr_sub),2)

#For each year in 2001-2011...
for i in range(len(yr_sub)):
    #...plot the distribution of mean_iit...
    nbhd_dict[yr_sub[i]].plot(column='50%_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i,0])
    axes[i,0].set_title('Median AGI in '+str(yr_sub[i]))
    axes[i,0].patch.set_facecolor('white')
    axes[i,0].set_aspect(1)
    axes[i,0].set_axis_off()
    #...and the distribution of mean_prop
    nbhd_dict[yr_sub[i]].plot(column='50%_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i,1])
    axes[i,1].set_title('Median Assessment Value in '+str(yr_sub[i]))
    axes[i,1].patch.set_facecolor('white')
    axes[i,1].set_aspect(1)
    axes[i,1].set_axis_off()

#Remove unused space
plt.tight_layout()

#Adjust titles on last two plots
axes[3,0].set_title('Median AGI Difference (2001-2011)')
axes[3,1].set_title('Median Assessment Value Difference (2001-2011)')

plt.savefig(img_dir+'MedianTaxBase_nbhd.png')

While average assessment values have grown in the 'Near Northeast', the median values do not display the same action. Median values did have noticeable growth along the NW/NE border, which reinforced the eastward shift. On the other hand, median incomes values cluster in the center of the city just like mean values.

What about business license activity? The autoregressive nature of this figure is reduced because it captures license growth (unlike the stock characteristics captured by adjusted gross income and assessment value). Consequently, we will look at each year over the 2001-2011 period.

In [15]:
#Set plot size
plt.rcParams['figure.figsize']=18,25

#Create plotting object
fig,axes=plt.subplots(4,3)

#For each year in 2001-2011...
for i in range(len(yr_list)):
    #...plot the distribution of mean_iit...
    nbhd_dict[yr_list[i]].plot(column='num_bbl', scheme='QUANTILES', k=5, colormap='Purples',axes=axes[i//3,i%3],alpha=.5)
    #...set face color to white and...
    axes[i//3,i%3].patch.set_facecolor('white')
    #...set titles
    axes[i//3,i%3].set_title(str(yr_list[i]))
    #...turn off those pesky axes...
    axes[i//3,i%3].set_axis_off()
    #...and set the aspect ratio
    axes[i//3,i%3].set_aspect(1)
    
#Set super title
plt.suptitle('Distribution of Basic Business Licenses',fontsize=22)
plt.subplots_adjust(top=1.85)

plt.tight_layout()
plt.axis('off')

plt.savefig(img_dir+'BBL_nbhd.png')

Interesting, the patterns for business development do not appear to shift much over time at first blush.

Which areas have received the greatest amount of investment?

In [16]:
#Sum exp by neighborhood
exp_tot=stats['exp'].groupby(level='aneigh').sum()

#Join with GDF
nbhd['exp_tot']=exp_tot.ix[nbhd.index]

#Plot investment over the 2001-2011 time period
plt.rcParams['figure.figsize']=15,15
fig=plt.figure()
ax = fig.add_subplot(111)
ax.patch.set_facecolor('white')
nbhd.fillna(0).plot(column='exp_tot', scheme='Quantiles', k=9, colormap='Greens',axes=ax)
ax.set_title('Distribution of Total Expenditure by Neighborhood (2001-2011)')
ax.set_aspect(1)
ax.set_axis_off()

plt.savefig(img_dir+'TransInvest_nbhd.png')

Exploring the Growth Environment

Now that we have an idea of the dynamics of value levels, it would be interesting to explore the growth trajectory of these neighborhoods. Specifically, we want to know the following:

  1. What is the relative rate of growth across neighborhoods?
  2. Do we notice differences in growth rates by 'era'?

The latter seeks to understand shifts in growth profiles. For our purposes, 'era' refers to one of three collections of years:

  1. 2002-2004: captures the .com/9-11 effect
  2. 2005-2007: captures the back end of the .com recession/9-11
  3. 2008-2011: captures the Great Recession and recovery period

Obviously these 'eras' are not easily separated. The idea is just to provide some allowance for differing growth regimes. To construct them, we will fold the stats DF into the three periods (aggregating via mean), and merge them in with the nbhd GeoDataFrame.

In [17]:
#Create copy of stats, but capturing percentage change
stats_era=stats.pct_change().reset_index().copy()

#Create mapping dictionary for eras
era_map_dict=dict({2001:0,
               2002:0,
               2003:0,
               2004:0,
               2005:1,
               2006:1,
               2007:1,
               2008:2,
               2009:2,
               2010:2,
               2011:2})

#Recode years to eras
stats_era['era']=stats_era['year'].map(era_map_dict)

#Groupby neighborhood and era, aggregating by mean
s_era=stats_era.groupby([stats_era['aneigh'],stats_era['era']]).mean()

#Create container for new GDFs
era_gdf_list=[]

#Create container for era labels
era_list=[]

#For each era...
for era in s_era.index.get_level_values(level='era'):
    #...capture the era...
    era_list.append(era)
    #...and capture a new era-specific GDF...
    era_gdf_list.append(gp.GeoDataFrame(nbhd.join(s_era.xs(era,level='era'))))

#Capture new era-specific GDFs in a dictionary
era_dict=dict(zip(era_list,era_gdf_list))

era_dict[1].head()
Out[17]:
DESCRIPTIO GIS_ID LABELB NBHD NBHD_NUM NBHD_TEXT NHDLNK OBJECTID SHAPE_AREA SHAPE_LEN geometry exp_tot year mean_iit mean_prop exp num_bbl 50%_iit 50%_prop era
aneigh
16th Street Heights 049 16th Street Heights nhood_049 None 049 49 49 049 19 0 0 POLYGON ((397753.9375005662441254 141723.29690... NaN 2006 0.034543 0.205481 NaN 0.343482 0.020916 0.238907 1
American University 001 American University nhood_001 None 001 1 1 001 4 0 0 POLYGON ((393362.3124991357326508 141606.35940... NaN 2006 0.045758 0.138915 NaN -0.022304 0.025021 0.135439 1
Anacostia 002 Anacostia nhood_002 None 002 2 2 002 5 0 0 POLYGON ((402129.8438016176223755 134197.10940... NaN 2006 0.021353 0.244339 NaN -0.135516 0.019955 0.207580 1
Anacostia Park 064 Anacostia Park nhood_064 None 064 64 64 064 59 0 0 POLYGON ((402293.9999981075525284 134723.28129... NaN 2006 NaN -0.078672 NaN NaN NaN -0.076647 1
Barry Farms 003 Barry Farms nhood_003 None 003 3 3 003 6 0 0 POLYGON ((399873.1874952167272568 133147.24999... NaN 2006 0.041114 0.184210 NaN 3.942210 0.019017 0.164324 1

5 rows × 20 columns

It would also be usefult to know the economic potential of these neighborhoods to some extent. We can do this by assigning each neighborhood a quintile ranking pegged to the 2001 mean_iit value via a dictionary map from the neighborhood values in the index. (We could, of course, choose a different value for this filter if desired.) Note that we will want to color by quintile to add an informational dimension to the plots. We will, therefore, include the set of colors used directly into our newly augmented GDFs.

UPDATE (2/27/14): In an effort to generate consistent color schemes for the same process, two color parameters are added (one for AGI and one for assessment value). The colors vary linearly with the value of average AGI in 2001, but the data are extreme coded at 5% and 95%.

In [18]:
#Subset to 2001
stat01=stats.xs(2001,level='year')

#Capture quintile groupings
stat01['q_econ']=pd.qcut(stats.xs(2001,level='year')['mean_iit'],5)

#Generate dictionary mapping quintile groupings to clearer labels
q_dict=dict({'[4376.277, 24504.371]':'q1',
             '(24504.371, 35631.989]':'q2',
             '(35631.989, 61488.399]':'q3',
             '(61488.399, 109608.369]':'q4',
             '(109608.369, 916785.75]':'q5'})

#Generate a dictionary mapping colors to quintiles
qcolor_dict=dict({'q1':'#EDF8E9',
                  'q2':'#BAE4B3',
                  'q3':'#74C476',
                  'q4':'#31A354',
                  'q5':'#006D2C'})
             
#Assign new quintile labels
stat01['q_start']=stat01['q_econ'].map(q_dict)

#Assign colors
stat01['colors']=stat01['q_start'].map(qcolor_dict)

#Truncate 2001 average AGI (for color display)
##Set bounds
ceiling=scipy.stats.scoreatpercentile(stat01['mean_iit'],95)
floor=scipy.stats.scoreatpercentile(stat01['mean_iit'],5)
##Create container for new values
inc_list=[]
##Recode
for iit in stat01['mean_iit']:
    inc=None
    if iit > ceiling:
        inc=ceiling
    elif iit < floor:
        inc=floor
    else:
        inc=iit
    inc_list.append(inc)

#Generate 2001 version of mean_iit
stat01['mean_iit2001']=Series(inc_list,index=stat01.index)

#Generate new color maps, specific to AGI and assessment value
agi_col_map=color.color_mapper((floor,ceiling),cmap='Blues',start=.2)
aval_col_map=color.color_mapper((floor,ceiling),cmap='Reds',start=.2)

#Generate colors based upon truncated mean_iit
stat01['agi_col']=stat01['mean_iit2001'].apply(lambda x: agi_col_map(x))
stat01['aval_col']=stat01['mean_iit2001'].apply(lambda x: aval_col_map(x))

#Define new function to extreme code indicators of interest
def extreme(x):
    '''Caps percentage change on [-2,2] interval'''
    #Create container for new values
    x_new_list=[]
    
    #For each value...
    for i in range(len(x)):
        #...if less than -2 cap at -2...
        if x[i]<-2:
            x_new_list.append(-2.0)
        #...if between -2 and 2 leave as is...
        elif (x[i]>-2) & (x[i]<=2):
            x_new_list.append(x[i])
        #...otherwise cap at 2
        elif x[i]>2:
            x_new_list.append(2)
        else:
            x_new_list.append(0)
    return x_new_list

#Create new container for augmented era GDFs
era_list2=[]

#Define desired columns
era_cols=['q_start','colors','agi_col','aval_col']

#For df in era_dict...
for df in era_dict.values():
    #...add in q_start...
    df_new=pd.merge(df,stat01[era_cols],how='left',left_index=True,right_index=True)
    #...add in extreme coded values...
    df_new['mean_iit_ec']=extreme(df_new['mean_iit'])
    df_new['50%_iit_ec']=extreme(df_new['50%_iit'])
    df_new['mean_prop_ec']=extreme(df_new['mean_prop'])
    df_new['50%_prop_ec']=extreme(df_new['50%_prop'])
    #...add in colors...
    
    #...and throw it in the list
    era_list2.append(df_new)

#Create new dict
era_dict2=dict(zip([1,2,3],era_list2))
    
era_dict2[1].head().T
Out[18]:
aneigh 16th Street Heights American University Anacostia Anacostia Park Barry Farms
DESCRIPTIO 049 16th Street Heights 001 American University 002 Anacostia 064 Anacostia Park 003 Barry Farms
GIS_ID nhood_049 nhood_001 nhood_002 nhood_064 nhood_003
LABELB None None None None None
NBHD 049 001 002 064 003
NBHD_NUM 49 1 2 64 3
NBHD_TEXT 49 1 2 64 3
NHDLNK 049 001 002 064 003
OBJECTID 19 4 5 59 6
SHAPE_AREA 0 0 0 0 0
SHAPE_LEN 0 0 0 0 0
geometry POLYGON ((397753.9375005662441254 141723.29690... POLYGON ((393362.3124991357326508 141606.35940... POLYGON ((402129.8438016176223755 134197.10940... POLYGON ((402293.9999981075525284 134723.28129... POLYGON ((399873.1874952167272568 133147.24999...
exp_tot NaN NaN NaN NaN NaN
year 2002.5 2002.5 2002.5 2004 2002.5
mean_iit 0.04671634 0.3501658 -0.1844525 NaN -0.05471703
mean_prop 0.2068368 0.1055571 -0.1410949 24.87695 -0.08580156
exp NaN NaN NaN NaN NaN
num_bbl 0.6166667 1.476732 0.1241586 NaN 1.715476
50%_iit 0.02774383 0.3575031 -0.1696829 NaN -0.03929195
50%_prop 0.2243305 0.07886029 -0.1506306 36.07052 -0.119842
era 0 0 0 0 0
q_start q3 q4 q1 NaN q1
colors #74C476 #31A354 #EDF8E9 NaN #EDF8E9
agi_col (0.706343731459, 0.829019618034, 0.91271050116... (0.326289898858, 0.618623629037, 0.80279893524... (0.778685134299, 0.860299892987, 0.93799308328... NaN (0.790495975579, 0.868173787173, 0.94193003037...
aval_col (0.988235294819, 0.661453307376, 0.54897348530... (0.957001155965, 0.308712036586, 0.22191465613... (0.988419839214, 0.736747420535, 0.63589390866... NaN (0.989404075987, 0.754955800842, 0.66000770961...
mean_iit_ec 0.04671634 0.3501658 -0.1844525 0 -0.05471703
50%_iit_ec 0.02774383 0.3575031 -0.1696829 0 -0.03929195
mean_prop_ec 0.2068368 0.1055571 -0.1410949 2 -0.08580156
50%_prop_ec 0.2243305 0.07886029 -0.1506306 2 -0.119842

28 rows × 5 columns

First of all, what were the growth rankings in each of the eras? We can look at each era in turn, comparing adjusted gross income with assessment value.

In [19]:
def growth_compare(i,mean_med):
    #Generate plot space
    fig,axes=plt.subplots(1,2,sharey=True)
    #Create new GDF that excludes empty colors
    df_temp=era_dict2[i][era_dict2[i]['colors'].notnull()].sort(columns='mean_iit')
    #Identify color values
    colors=list(df_temp['colors'].values)
    #Plot AGI and assessment value
    df_temp[mean_med+'_iit_ec'].plot(kind='barh',ax=axes[0],color=list(df_temp['agi_col'].values),title='AGI Growth')
    axes[0].set_ylabel('Assessment Neighborhood')
    df_temp[mean_med+'_prop_ec'].plot(kind='barh',ax=axes[1],color=list(df_temp['aval_col'].values),title='Assessment Value Growth')
    axes[1].set_ylabel('')
    

growth_compare(1,'mean')
plt.suptitle('Average Annual Growth During Era #1 (2001-2004)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar1_nbhd.png')

growth_compare(1,'50%')
plt.suptitle('Median Annual Growth During Era #1 (2001-2004)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar1_nbhd.png')

Interesting. Early on there appeared to be little correlation between the growth rates in AGI and the growth rates in assessment value. It does appear, however, that there is some correlation between starting mean AGI and growth rate. Darker colors indicate higher quintiles.

Similar stories seem to arise in the median data as well. There are some curious deviations however. Neighborhoods like Cleveland Park, Georgetown, and Foggy Bottom experienced marginal growth in median AGI despite material increases in mean AGI. In contrast, Washington Navy Yard and DC Village saw dramatic increases in median income without corresponding increases mean income. Clearly the shape of the income profile has changed dramatically in some neighborhoods. Assessment values, however, appear to be characterized by more even growth.

How does the picture look in the second era?

In [20]:
growth_compare(2,'mean')
plt.suptitle('Average Annual Growth During Era #2 (2005-2007)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar2_nbhd.png')

growth_compare(2,'50%')
plt.suptitle('Median Annual Growth During Era #2 (2005-2007)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar2_nbhd.png')

The second era is characterized by a substantial decline in variation, for both AGI and assessed value. Furthermore, of those experiencing positive growth in AGI, the magnitude of said growth decline. Assessed values, on the other hand, experienced virtually positive growth for all neighborhoods. The correlation between AGI growth and assessed value growth still appears to be weak at best.

Comparing the mean and median plots reveals distributions that, by and large, became more skewed towards higher incomes and assessment values during this period (as evidenced by mean growth outpacing median growth in most neighborhoods). This is perhaps unsurprising, fitting a national narrative about capital wealth appreciation in the lead up to the Recession.

What about the Great Recession and recovery era?

In [21]:
growth_compare(3,'mean')
plt.suptitle('Average Annual Growth During Era #3 (2008-2011)',fontsize=20)
plt.savefig(img_dir+'AvgGrowthBar3_nbhd.png')

growth_compare(3,'50%')
plt.suptitle('Median Annual Growth During Era #3 (2008-2011)',fontsize=20)
plt.savefig(img_dir+'MedianGrowthBar3_nbhd.png')

Growth all but vanishes in this time period, and the correlation between higher starting AGI and higher growth seems to have dissipated as well. Furthermore, aside from a sharp downward skew in Massachusetts Avenue Heights, the story is consistent across mean and median data.

Mapping these data would aid in identifying clustered growth activity. We can map both AGI and assessed value by era.

In [22]:
#Set plot dimensions
plt.rcParams['figure.figsize']=18,24

#Generate plotting object
fig,axes=plt.subplots(len(era_dict2),2)

#For each era...
for i in era_dict2:
    #...plot the distribution of mean_iit growth...
    gp.GeoDataFrame(era_dict2[i]).plot(column='mean_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i-1,0])
    axes[i-1,0].set_title('Average Growth in Average AGI - Era #'+str(i))
    axes[i-1,0].patch.set_facecolor('white')
    axes[i-1,0].set_axis_off()
    axes[i-1,0].set_aspect(1)
    #...and the distribution of mean_prop growth
    gp.GeoDataFrame(era_dict2[i]).plot(column='mean_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i-1,1])
    axes[i-1,1].set_title('Average Growth in Average Assessment Value - Era #'+str(i))
    axes[i-1,1].patch.set_facecolor('white')
    axes[i-1,1].set_axis_off()
    axes[i-1,1].set_aspect(1)

plt.tight_layout()
plt.savefig(img_dir+'AvgGrowthMap_era_nbhd.png')

Note that the colors here do not correspond to the same groups defined above (by mean AGI quintiles in 2001). Rather, they are quintile ranges measured with the data from each era's GDF. This is the primary reason for the shift in color.

Relative income growth continues to shift from dense concentrations in the affluent northwest to pockets located in the center of the District. The eastward shift in property growth is far more dramatic. The highest growth values occur largely in the residential districts to the east of Brookland in later years.

One might wonder whether or not the median data corroborate this story.

In [23]:
#Generate plotting object
fig,axes=plt.subplots(len(era_dict2),2)

#For each era...
for i in era_dict2:
    #...plot the distribution of mean_iit growth...
    gp.GeoDataFrame(era_dict2[i]).plot(column='50%_iit', scheme='QUANTILES', k=5, colormap='Blues',axes=axes[i-1,0])
    axes[i-1,0].set_title('Average Growth in Median AGI - Era #'+str(i))
    axes[i-1,0].patch.set_facecolor('white')
    axes[i-1,0].set_axis_off()
    axes[i-1,0].set_aspect(1)
    #...and the distribution of mean_prop growth
    gp.GeoDataFrame(era_dict2[i]).plot(column='50%_prop', scheme='QUANTILES', k=5, colormap='Reds',axes=axes[i-1,1])
    axes[i-1,1].set_title('Average Growth in Median Assessment Value - Era #'+str(i))
    axes[i-1,1].patch.set_facecolor('white')
    axes[i-1,1].set_axis_off()
    axes[i-1,1].set_aspect(1)

plt.tight_layout()
plt.savefig(img_dir+'MedianGrowthMap_era_nbhd.png')

Indeed they do, but the eastward trend appears more pronounced on the income side, while the median assessment value growth activity has a distinctly southern flavor.

Are Investment Neighborhoods Different?

In many ways, the above analysis provides empirical support to a narrative familiar to ORA staff. The income distribution in DC is shifting spatially. New real estate opportunities are being exploited by young professionals in historically less developed neighborhoods, which disproportionately reside on the eastern side of the District. We are seeing income and property value movement that is largely consistent with a shift in the locus of economic activity.

While this is generally interesting, the motivation for this exploration is the analysis of the impact of transformative investment. It would thus be useful to explore the correlation between public investment and growth in our indicators. To be specific, we need to be able to directly compare levels of public investment and growth in AGI and assessment value. We already have the growth values, but we lost our levels during the procedure used to calculate the growth values. Consequently, we need to calculate average investment for each neighborhood in each era. To preserve the ability to compare past investment to future growth, we will capture all of the era investment averages in one DF and then merge them with the growth indicators from each of the three eras.

In [24]:
#Create copy of stats, but capturing levels this time
inv_era=stats.reset_index().copy()

#Recode years to eras
inv_era['era']=inv_era['year'].map(era_map_dict)

#Group investment by neighborhood and era, aggregating by mean
i_era=inv_era['exp'][inv_era['era'].notnull()].groupby([inv_era['aneigh'],inv_era['era']]).mean().fillna(0)

#Group business licenses by neighborhood and era, aggregating by sum
bbl_era=inv_era['num_bbl'][inv_era['era'].notnull()].groupby([inv_era['aneigh'],inv_era['era']]).sum().fillna(0)

#Combine investment and BBL data
inv_bbl_era=DataFrame(i_era).rename(columns={0:'inv_era'}).join(DataFrame(bbl_era).rename(columns={0:'bbl_era'}))

#Unstack to hold all eras with a single neighborhood reference
ib_era=inv_bbl_era.unstack(level='era')

#Rename columns
ib_era.columns=['inv1','inv2','inv3','bbl1','bbl2','bbl3']

ib_era.head(20)
Out[24]:
inv1 inv2 inv3 bbl1 bbl2 bbl3
aneigh
16th Street Heights 0.00 0.000000 0.0 94 218 439
American University 0.00 0.000000 0.0 88 140 352
Anacostia 0.00 0.000000 0.0 280 264 815
Anacostia Park 0.00 0.000000 0.0 0 0 2
Barry Farms 0.00 0.000000 0.0 33 259 239
Berkley 0.00 0.000000 0.0 4 21 39
Bolling Air Force Base 0.00 0.000000 0.0 7 15 5
Brentwood 0.00 0.000000 1200000.0 171 363 649
Brightwood 0.00 0.000000 0.0 148 321 654
Brookland 0.00 0.000000 0.0 283 565 1210
Burleith 0.00 0.000000 0.0 10 117 112
Capitol Hill 0.00 0.000000 0.0 99 405 771
Central-tri 1 0.00 0.000000 0.0 612 846 2747
Central-tri 3 18496552.75 1044743.666667 5246265.5 334 970 2390
Chevy Chase 0.00 0.000000 0.0 89 168 466
Chillum 0.00 0.000000 0.0 69 101 222
Cleveland Park 0.00 0.000000 0.0 21 177 557
Colonial Village 0.00 0.000000 0.0 2 5 36
Columbia Heights 513868.50 11223061.000000 11797136.5 372 874 1854
Congress Heights 0.00 0.000000 0.0 178 568 1182

20 rows × 6 columns

With the investment and BBL information compiled by neighborhood and era, we can merge it back in with the existing data.

In [25]:
#Create container for new GDFs
era_gdf_list2=[]

#Create container for era labels
era_list2=[]

#For each era...
for era in era_dict2.keys():
    #...capture the era...
    era_list2.append(era)
    #...and capture a new era-specific GDF...
    era_gdf_list2.append(gp.GeoDataFrame(era_dict2[era].join(ib_era)))

#Capture new era-specific GDFs in a dictionary
era_dict3=dict(zip(era_list2,era_gdf_list2))

era_dict3[3].head().T
Out[25]:
aneigh 16th Street Heights American University Anacostia Anacostia Park Barry Farms
DESCRIPTIO 049 16th Street Heights 001 American University 002 Anacostia 064 Anacostia Park 003 Barry Farms
GIS_ID nhood_049 nhood_001 nhood_002 nhood_064 nhood_003
LABELB None None None None None
NBHD 049 001 002 064 003
NBHD_NUM 49 1 2 64 3
NBHD_TEXT 49 1 2 64 3
NHDLNK 049 001 002 064 003
OBJECTID 19 4 5 59 6
SHAPE_AREA 0 0 0 0 0
SHAPE_LEN 0 0 0 0 0
geometry POLYGON ((397753.9375005662441254 141723.29690... POLYGON ((393362.3124991357326508 141606.35940... POLYGON ((402129.8438016176223755 134197.10940... POLYGON ((402293.9999981075525284 134723.28129... POLYGON ((399873.1874952167272568 133147.24999...
exp_tot NaN NaN NaN NaN NaN
year 2009.5 2009.5 2009.5 2009 2009.5
mean_iit 0.001796586 0.02342792 0.02161648 NaN 0.04728481
mean_prop 0.02011667 0.02658858 0.08858267 0.06899296 0.05361671
exp NaN NaN NaN NaN NaN
num_bbl 0.3977771 0.2475023 0.9001058 -0.9565217 -0.105701
50%_iit 0.003418689 0.02206532 0.01732333 NaN 0.0269894
50%_prop 0.003246311 0.007945699 0.0633747 0.008518167 0.07127914
era 2 2 2 2 2
q_start q3 q4 q1 NaN q1
colors #74C476 #31A354 #EDF8E9 NaN #EDF8E9
agi_col (0.706343731459, 0.829019618034, 0.91271050116... (0.326289898858, 0.618623629037, 0.80279893524... (0.778685134299, 0.860299892987, 0.93799308328... NaN (0.790495975579, 0.868173787173, 0.94193003037...
aval_col (0.988235294819, 0.661453307376, 0.54897348530... (0.957001155965, 0.308712036586, 0.22191465613... (0.988419839214, 0.736747420535, 0.63589390866... NaN (0.989404075987, 0.754955800842, 0.66000770961...
mean_iit_ec 0.001796586 0.02342792 0.02161648 0 0.04728481
50%_iit_ec 0.003418689 0.02206532 0.01732333 0 0.0269894
mean_prop_ec 0.02011667 0.02658858 0.08858267 0.06899296 0.05361671
50%_prop_ec 0.003246311 0.007945699 0.0633747 0.008518167 0.07127914
inv1 0 0 0 0 0
inv2 0 0 0 0 0
inv3 0 0 0 0 0
bbl1 94 88 280 0 33
bbl2 218 140 264 0 259
bbl3 439 352 815 2 239

34 rows × 5 columns

We can now compare investment in the three eras. Colorized correlation plots may be helpful.

Note that while these plots will be for primary use, there are several more detailed scatter plots and correlation matrices in the Appendix. For each era, we will evaluate the correlation between our responses (AGI and assessment value growth) and investment that occurs in the current and prior eras. Thus, for the first era responses, scatters will evaluate only first era investment. For the second era responses, scatters evaluate both the second and first era investments. In similar fashion, the third era responses are plotted against all three.

Getting back to the plots that immediately follow, unfortunately, Python sometimes makes very simple things difficult. In this case, while plotting the correlation matrix as a psuedocolors is relatively straightforward, centering diverging colors on zero is not. Why do we want to center on zero? It aids in interpretation if one color is always positive and one is always negative.

To make this happen in Python, we always have the option of explicitly defining color ramps for our particular data set. A more robust option is to define a new subclass that inherits from matplotlib.colors.Normalize. I guess I had to dive into custom classes at some point...

In [26]:
from matplotlib.colors import Normalize

class MidpointNormalize(Normalize):
    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        # I'm ignoring masked values and all kinds of edge cases to make a
        # simple example...
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))

This subclass option is the only method that allows for continuous color mapping, something we are more often interested in than not.

In [27]:
#Define desired columns
cor_cols1=['mean_iit','mean_prop','50%_iit','50%_prop','inv1']

#Define first era subset
e1_sub=era_dict3[1][cor_cols1].dropna()

#Generate correlation matrix (transposing captures order provided in the DF)
e1_corr=np.corrcoef(e1_sub.T)

#Set plot size
plt.rcParams['figure.figsize']=13,10

#Construct subplot
fig,ax=plt.subplots()

#Define midpoint object
norm=MidpointNormalize(midpoint=0)

#Generate plot object
e1c=ax.imshow(e1_corr,norm=norm,cmap='RdBu',interpolation='none')

#Add color bar
fig.colorbar(e1c)

#Set tick positions
labelPos=np.arange(0,len(e1_sub.columns))

#Set tick labels
pylab.xticks(labelPos,e1_sub.columns)
pylab.yticks(labelPos,e1_sub.columns)
pylab.title('Correlation Plot in Era #1 (2001-2004)');
In [28]:
#Define desired columns
cor_cols2=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2']

#Define second era subset
e2_sub=era_dict3[2][cor_cols2].dropna().replace(inf,0)

#Generate correlation matrix (transposing captures order provided in the DF)
e2_corr=np.corrcoef(e2_sub.T)

#Set plot size
plt.rcParams['figure.figsize']=13,10

#Construct subplot
fig,ax=plt.subplots()

#Generate plot object
e2c=ax.imshow(e2_corr,norm=norm,cmap='RdBu',interpolation='none')

#Add color bar
fig.colorbar(e2c)

#Set tick positions
labelPos2=np.arange(0,len(e2_sub.columns))

#Set tick labels
pylab.xticks(labelPos2,e2_sub.columns)
pylab.yticks(labelPos2,e2_sub.columns)
pylab.title('Correlation Plot in Era #2 (2005-2007)')
Out[28]:
<matplotlib.text.Text at 0xd000f10>
In [29]:
#Define desired columns
cor_cols3=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2','inv3']

#Define third era subset
e3_sub=era_dict3[3][cor_cols3].dropna()

#Generate correlation matrix (transposing captures order provided in the DF)
e3_corr=np.corrcoef(e3_sub.T)

#Set plot size
plt.rcParams['figure.figsize']=13,10

#Construct subplot
fig,ax=plt.subplots()

#Generate plot object
e3c=ax.imshow(e3_corr,norm=norm,cmap='RdBu',interpolation='none')

#Add color bar
fig.colorbar(e3c)

#Set tick positions
labelPos3=np.arange(0,len(e3_sub.columns))

#Set tick labels
pylab.xticks(labelPos3,e3_sub.columns)
pylab.yticks(labelPos3,e3_sub.columns)
pylab.title('Correlation Plot in Era #3 (2008-2011)')
Out[29]:
<matplotlib.text.Text at 0x1199b050>

Comparison of the correlation plots over the three periods reveals remarkably different behavior across time periods. It suggests that the association between the variables presented here is conditional on outside factors. This is most noticeable in the third era, with the general level of positive correlation increasing markedly. There are also a few particular items of note:

  1. In the first era, mean and median assessment value growth is highly correlated. While mean and median income is also positively correlated, it is to a much lesser extent.
  2. In the second era, mean and median income are correlated in a way similar to the first era. In contrast, mean and median assessment value are now moving in opposite directions.
  3. In the first two eras, investment levels are weakly or outright negatively correlated with growth in the income and property tax bases. This picture changes in the third era to a weak or somewhat postivie correlation.
  4. The most significant investment correlation appears in the third era. Average assessment value growth is moderately correlated with investment in the previous time period.

At this point, we can look at average growth rates in investment neighborhoods relative to the other neighborhoods in each era. For our purposes, we will view the existence of investment as a binary treatment. Any non-zero investment level neighborhoods go into the treatment group, while all remaining neighborhoods constitute the control. The first step is parsing out these groups for each era.

In [30]:
#Establish treatment groups by era
e1_treat=era_dict3[1]['inv1'][era_dict3[1]['inv1']>0].index
e2_treat=era_dict3[2]['inv2'][era_dict3[2]['inv2']>0].index
e3_treat=era_dict3[3]['inv3'][era_dict3[3]['inv3']>0].index

#Establish control groups by era
e1_control=era_dict3[1]['inv1'][era_dict3[1]['inv1']==0].index
e2_control=era_dict3[2]['inv2'][era_dict3[2]['inv2']==0].index
e3_control=era_dict3[3]['inv3'][era_dict3[3]['inv3']==0].index

For each of these groups, we need to compare the average growth rates within each era.

In [31]:
#Capture desired columns
comp_cols=['mean_iit','mean_prop','50%_iit','50%_prop']

#Capture average growth rates for treatment and control groups in a single DF
e1_comp=DataFrame(era_dict3[1].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[1].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e2_comp=DataFrame(era_dict3[2].ix[e2_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[2].ix[e2_control][comp_cols].mean())).rename(columns={0:'control'})
e3_comp=DataFrame(era_dict3[3].ix[e3_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[3].ix[e3_control][comp_cols].mean())).rename(columns={0:'control'})

#Plot comparisons for each era
plt.rcParams['figure.figsize']=15,20
fig,axes=plt.subplots(3,sharex=True)
e1_comp.plot(kind='bar',ax=axes[0],title='Growth Rate Comparisons - First Era',color=['#ef8a62','#999999'])
axes[0].patch.set_facecolor('white')
e2_comp.plot(kind='bar',ax=axes[1],title='Growth Rate Comparisons - Second Era',color=['#ef8a62','#999999'])
axes[1].patch.set_facecolor('white')
e3_comp.plot(kind='bar',ax=axes[2],title='Growth Rate Comparisons - Third Era',color=['#ef8a62','#999999'])
axes[2].patch.set_facecolor('white')
axes[2].set_xticklabels(['Average AGI','Average Assessment Value','Median AGI','Median Assessment Value'])

#Set rotation of tick labels
plt.xticks(rotation=0)

#Fix spacing
plt.tight_layout()


plt.savefig(img_dir+'GrwthCompare.png')

Wow, that's interesting. For almost every comparison, the control group grew faster. This smacks of selection bias in the identification of projects. That guy is fighting with spatial resolution to become public enemy #1.

It would be useful to capture the same plots for lagged investment scenarios:

  1. Era 2 data with Era 1 groups;
  2. Era 3 data with Era 1 groups; and,
  3. Era 3 data with Era 2 groups.

These plots should allow us to see distancing that may occur after the investment has had a few years to take an effect.

In [32]:
#Capture average growth rates for treatment and control groups in a single DF
e2L1_comp=DataFrame(era_dict3[2].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[2].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e3L1_comp=DataFrame(era_dict3[3].ix[e1_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[3].ix[e1_control][comp_cols].mean())).rename(columns={0:'control'})
e3L2_comp=DataFrame(era_dict3[3].ix[e2_treat][comp_cols].mean()).rename(columns={0:'treatment'}).\
            join(DataFrame(era_dict3[3].ix[e2_control][comp_cols].mean())).rename(columns={0:'control'})

#Plot comparisons for each era
plt.rcParams['figure.figsize']=15,20
fig,axes=plt.subplots(3,sharex=True)
e2L1_comp.plot(kind='bar',ax=axes[0],title='Lagged Growth Rate Comparisons - First Era Groups & Second Era Data',\
               color=['#ef8a62','#999999'])
axes[0].patch.set_facecolor('white')
e3L1_comp.plot(kind='bar',ax=axes[1],title='Lagged Growth Rate Comparisons - First Era Groups & Third Era Data',\
               color=['#ef8a62','#999999'])
axes[1].patch.set_facecolor('white')
e3L2_comp.plot(kind='bar',ax=axes[2],title='Lagged Growth Rate Comparisons - Second Era Groups & Third Era Data',\
               color=['#ef8a62','#999999'])
axes[2].patch.set_facecolor('white')
axes[2].set_xticklabels(['Average AGI','Average Assessment Value','Median AGI','Median Assessment Value'])

#Set rotation of tick labels
plt.xticks(rotation=0)

#Fix spacing
plt.tight_layout()

plt.savefig(img_dir+'LagGrwthCompare.png')

It appears that in the second era (2005-2007), the treatment group neighborhoods did make some progress relative to the control groups. This was, however, quickly and substantially, reversed in the third time period.

Appendix

Well, this supports the notion that early detection of tax base response to investment is a less than straightforward exercise. Let's take a look at some scatters.

Here is the first era (2001-2004).

In [58]:
#Define columns to insert into the matrix
cor_cols=['mean_iit','mean_prop','50%_iit','50%_prop','inv1','inv2','inv3']

scatter_matrix(era_dict3[1][cor_cols],alpha=.4,figsize=(20,20),diagonal='kde');

The second era (2005-2007)...

In [59]:
scatter_matrix(era_dict3[2][cor_cols].dropna().replace(inf,0),alpha=.4,figsize=(20,20),diagonal='kde');

...and the third era (2008-2011)

In [60]:
scatter_matrix(era_dict3[3][cor_cols],alpha=.8,figsize=(20,20),diagonal='kde');